Regresión Logística en R ------------------------ .. code:: r datos = read.csv("Social_Network_Ads.csv", sep = ",", dec = ".", header = T) print(head(datos)) .. parsed-literal:: User.ID Gender Age EstimatedSalary Purchased 1 15624510 Male 19 19000 0 2 15810944 Male 35 20000 0 3 15668575 Female 26 43000 0 4 15603246 Female 27 57000 0 5 15804002 Male 19 76000 0 6 15728773 Male 27 58000 0 .. code:: r str(datos) .. parsed-literal:: 'data.frame': 400 obs. of 5 variables: $ User.ID : int 15624510 15810944 15668575 15603246 15804002 15728773 15598044 15694829 15600575 15727311 ... $ Gender : chr "Male" "Male" "Female" "Female" ... $ Age : int 19 35 26 27 19 27 27 32 25 35 ... $ EstimatedSalary: int 19000 20000 43000 57000 76000 58000 84000 150000 33000 65000 ... $ Purchased : int 0 0 0 0 0 0 0 1 0 0 ... **Variables:** .. code:: r df <- datos[,c("Gender", "Age", "EstimatedSalary", "Purchased")] print(head(df)) .. parsed-literal:: Gender Age EstimatedSalary Purchased 1 Male 19 19000 0 2 Male 35 20000 0 3 Female 26 43000 0 4 Female 27 57000 0 5 Male 19 76000 0 6 Male 27 58000 0 .. code:: r dim(df) .. raw:: html

**Variable y:** .. code:: r print(table(df[,c("Purchased")])) .. parsed-literal:: 0 1 257 143 .. code:: r print(table(df[,c("Purchased")])/nrow(df)) .. parsed-literal:: 0 1 0.6425 0.3575 .. code:: r library(ggplot2) .. code:: r ggplot(data = df) + geom_bar(aes(y = Purchased)) .. image:: output_10_0.png :width: 420px :height: 420px **Variable categórica:** .. code:: r unique(df$Gender) .. raw:: html

'Male'
'Female'

.. code:: r df$Gender <- factor(df$Gender, levels = c(unique(df$Gender)), labels = c(1,0)) print(head(df)) .. parsed-literal:: Gender Age EstimatedSalary Purchased 1 1 19 19000 0 2 1 35 20000 0 3 0 26 43000 0 4 0 27 57000 0 5 1 19 76000 0 6 1 27 58000 0 Ajuste modelo simple: ~~~~~~~~~~~~~~~~~~~~~ .. math:: p = \frac{1}{1+exp \left[- \left(\beta_0+\beta_1 \times Edad \right)\right]} .. code:: r logistic_simple <- glm(Purchased ~ Age, data = df, family = binomial) logistic_simple .. parsed-literal:: Call: glm(formula = Purchased ~ Age, family = binomial, data = df) Coefficients: (Intercept) Age -8.0441 0.1889 Degrees of Freedom: 399 Total (i.e. Null); 398 Residual Null Deviance: 521.6 Residual Deviance: 336.3 AIC: 340.3 .. code:: r print(head(logistic_simple$fitted.values)) .. parsed-literal:: 1 2 3 4 5 6 0.01149707 0.19295760 0.04182836 0.05009205 0.01149707 0.05009205 .. code:: r print(min(logistic_simple$fitted.values)) print(max(logistic_simple$fitted.values)) .. parsed-literal:: [1] 0.009536473 [1] 0.9641822 .. code:: r ggplot() + geom_histogram(aes(x = logistic_simple$fitted.values), binwidth = 0.01)+ scale_x_continuous(labels = scales::percent)+ labs(x = expression(p), y = "Número de observaciones")+ theme_minimal() .. image:: output_19_0.png :width: 420px :height: 420px .. code:: r summary(logistic_simple) .. parsed-literal:: Call: glm(formula = Purchased ~ Age, family = binomial, data = df) Deviance Residuals: Min 1Q Median 3Q Max -2.5091 -0.6548 -0.2923 0.5706 2.4470 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -8.04414 0.78417 -10.258 <2e-16 *** Age 0.18895 0.01915 9.866 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 521.57 on 399 degrees of freedom Residual deviance: 336.26 on 398 degrees of freedom AIC: 340.26 Number of Fisher Scoring iterations: 5 **Coeficientes:** .. code:: r coef <- logistic_simple$coefficients print(coef) .. parsed-literal:: (Intercept) Age -8.0441425 0.1889496 .. math:: exp \left( coeficientes \right) = \frac{p }{1-p } = oddsRatio .. code:: r OR <- exp(coef) # Odds Ratio print(OR) .. parsed-literal:: (Intercept) Age 0.0003209765 1.2079800978 Por cada año de más, la probabilidad de comprar aumenta un 21% aproximadamente. **Grados de libertad:** .. code:: r gl <- logistic_simple$df.null - logistic_simple$df.residual print(gl) .. parsed-literal:: [1] 1 **Diferencia de las desviaciones:** .. code:: r dif_residuos <- logistic_simple$null.deviance - logistic_simple$deviance print(dif_residuos) .. parsed-literal:: [1] 185.3117 **Valor p para bondad de ajuste:** .. code:: r p_value <- pchisq(q = dif_residuos, df = gl, lower.tail = FALSE) print(p_value) .. parsed-literal:: [1] 3.3555e-42 Se rechaza la hipótesis nula, entonces el modelo es significativo a un :math:`\alpha = 0,05`. **Error estándar:** .. math:: SE(\hat{\beta_j}) .. code:: r SE <- sqrt(diag(vcov(logistic_simple)))[2] print(SE) .. parsed-literal:: Age 0.01915181 **Estadístico z:** .. math:: zScore = \frac{\hat{\beta_j}}{SE(\hat{\beta_j})} .. code:: r zscore <- coef[2]/SE print(zscore) .. parsed-literal:: Age 9.865891 **Valor p para el coeficiente:** .. code:: r p_value <- 2*pnorm(abs(zscore), lower.tail = FALSE) print(p_value) .. parsed-literal:: Age 5.85129e-23 El coeficiente Edad es significativo. **Predicción:** .. code:: r predict(logistic_simple, newdata = data.frame(Age = 33)) .. raw:: html 1: -1.80880495351216 .. math:: p = \frac{exp \left(\beta_0+\beta_1 \right)}{1 + exp \left(\beta_0+\beta_1 \times X_1 \right)} .. code:: r exp(predict(logistic_simple, data.frame(Age = 33)))/(1+exp(predict(logistic_simple, data.frame(Age = 33)))) .. raw:: html 1: 0.140782619945824 .. math:: p = \frac{1}{1+exp \left[- \left(\beta_0+\beta_1 \times X_1 \right) \right]} .. code:: r 1/(1+exp(-predict(logistic_simple, data.frame(Age = 33)))) .. raw:: html 1: 0.140782619945824 Ajuste modelo múltiple: ~~~~~~~~~~~~~~~~~~~~~~~ .. math:: p = \frac{1}{1+exp \left[- \left(\beta_0+\beta_1 \times Género +\beta_2 \times Edad + \beta_3 \times Salario \right)\right]} .. code:: r logistic <- glm(Purchased ~ Gender + Age + EstimatedSalary, data = df, family = binomial) logistic .. parsed-literal:: Call: glm(formula = Purchased ~ Gender + Age + EstimatedSalary, family = binomial, data = df) Coefficients: (Intercept) Gender0 Age EstimatedSalary -1.245e+01 -3.338e-01 2.370e-01 3.644e-05 Degrees of Freedom: 399 Total (i.e. Null); 396 Residual Null Deviance: 521.6 Residual Deviance: 275.8 AIC: 283.8 .. code:: r print(head(logistic$fitted.values)) .. parsed-literal:: 1 2 3 4 5 6 0.0007061408 0.0314610656 0.0063340671 0.0132775722 0.0056085336 0.0191141370 .. code:: r print(min(logistic$fitted.values)) print(max(logistic$fitted.values)) .. parsed-literal:: [1] 0.0005440362 [1] 0.9988217 .. code:: r ggplot() + geom_histogram(aes(x = logistic$fitted.values), binwidth = 0.01)+ scale_x_continuous(labels = scales::percent)+ labs(x = expression(p), y = "Número de observaciones")+ theme_minimal() .. image:: output_53_0.png :width: 420px :height: 420px .. code:: r summary(logistic) .. parsed-literal:: Call: glm(formula = Purchased ~ Gender + Age + EstimatedSalary, family = binomial, data = df) Deviance Residuals: Min 1Q Median 3Q Max -2.9109 -0.5218 -0.1406 0.3662 2.4254 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -1.245e+01 1.309e+00 -9.510 < 2e-16 *** Gender0 -3.338e-01 3.052e-01 -1.094 0.274 Age 2.370e-01 2.638e-02 8.984 < 2e-16 *** EstimatedSalary 3.644e-05 5.473e-06 6.659 2.77e-11 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 521.57 on 399 degrees of freedom Residual deviance: 275.84 on 396 degrees of freedom AIC: 283.84 Number of Fisher Scoring iterations: 6 **Coeficientes:** .. code:: r coef <- logistic$coefficients print(coef) .. parsed-literal:: (Intercept) Gender0 Age EstimatedSalary -1.244979e+01 -3.338434e-01 2.369694e-01 3.644119e-05 .. math:: exp \left( coeficientes \right) = \frac{p }{1-p } = oddsRatio .. code:: r OR <- exp(coef) # Odds Ratio print(OR) .. parsed-literal:: (Intercept) Gender0 Age EstimatedSalary 3.918543e-06 7.161659e-01 1.267402e+00 1.000036e+00 - Por cada año de más, la probabilidad de comprar aumenta un 27% aproximadamente, si las demás variables permanecen constantes. Male = 1 Female = 0 - Comparando entre hombres y mujeres, los hombres aumentan la probabilidad de compra aproximadamente 7 veces, si las demás variables permanecen constantes. - Un aumento en el salario no aumenta ni disminuye la probabilidad de compra. **Grados de libertad:** .. code:: r gl <- logistic$df.null - logistic$df.residual print(gl) .. parsed-literal:: [1] 3 **Diferencia de las desviaciones:** .. code:: r dif_residuos <- logistic$null.deviance - logistic$deviance print(dif_residuos) .. parsed-literal:: [1] 245.7297 **Valor p para bondad de ajuste:** .. code:: r p_value <- pchisq(q = dif_residuos, df = gl, lower.tail = FALSE) print(p_value) .. parsed-literal:: [1] 5.487701e-53 Se rechaza la hipótesis nula, entonces el modelo es significativo a un :math:`\alpha = 0,05`, pero existen unos coeficientes no significativos. **Error estándar:** .. math:: SE(\hat{\beta_j}) .. code:: r SE <- sqrt(diag(vcov(logistic)))[2:4] print(SE) .. parsed-literal:: Gender0 Age EstimatedSalary 3.052264e-01 2.637705e-02 5.472858e-06 **Estadístico z:** .. math:: zScore = \frac{\hat{\beta_j}}{SE(\hat{\beta_j})} .. code:: r zscore <- coef[2:4]/SE print(zscore) .. parsed-literal:: Gender0 Age EstimatedSalary -1.093757 8.983922 6.658530 **Valor p para el coeficiente:** .. code:: r p_value <- 2*pnorm(abs(zscore), lower.tail = FALSE) print(p_value) .. parsed-literal:: Gender0 Age EstimatedSalary 2.740618e-01 2.612829e-19 2.765798e-11 Género no es significativo. Edad y salario si son significativos. **Predicción:** .. code:: r predict(logistic, newdata = data.frame(Gender = '0', Age = 33, EstimatedSalary = 42000)) .. raw:: html 1: -3.43311391519167 .. math:: p = \frac{1}{1+exp \left[- \left(\beta_0+\beta_1 \times X_1 + ... + \beta_k \times X_k \right) \right]} .. code:: r 1/(1+exp(-predict(logistic, newdata = data.frame(Gender = '0', Age = 33, EstimatedSalary = 42000)))) .. raw:: html 1: 0.031276448293889 Si se agrega el argumento ``type = "response"`` en la función ``predict()`` el resultados son las probabilidades :math:`p` y lugar los 𝑙𝑜𝑔(𝑜𝑑𝑑𝑠𝑅𝑎𝑡𝑖𝑜). .. code:: r predict(logistic, newdata = data.frame(Gender = '0', Age = 33, EstimatedSalary = 42000), type = "response") .. raw:: html 1: 0.031276448293889